Analyze A/B Test Results

Table of Contents

Introduction

In this project, i will be analysing an e-commerce company's website data. where the analysis is the comparison between the company's newly launched webpage and the old one(existing one). Where the company wants to know which of the wabpages attracts more customers.

The dataset has a total number of 294,478 samples(rows), and only five columns.

Part I - Probability

To get started, let's import our libraries.

In [102]:
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
%matplotlib inline
#We are setting the seed to assure you get the same answers on quizzes as we set up
random.seed(42)

1. Now, read in the ab_data.csv data. Store it in df.

a. Read in the dataset and take a look at the top few rows here:

In [103]:
df = pd.read_csv('ab_data.csv')

df.head()
Out[103]:
user_id timestamp group landing_page converted
0 851104 2017-01-21 22:11:48.556739 control old_page 0
1 804228 2017-01-12 08:01:45.159739 control old_page 0
2 661590 2017-01-11 16:55:06.154213 treatment new_page 0
3 853541 2017-01-08 18:28:03.143765 treatment new_page 0
4 864975 2017-01-21 01:52:26.210827 control old_page 1

b. the cell below to shows the number of rows and columns in the dataset.

In [104]:
df.shape
Out[104]:
(294478, 5)

c. The number of unique users in the dataset.

In [105]:
df.nunique()
Out[105]:
user_id         290584
timestamp       294478
group                2
landing_page         2
converted            2
dtype: int64

d. The proportion of users converted.

In [106]:
#here 1 represents True and 0 False.
df.groupby('converted')['user_id'].count()/df.shape[0]
Out[106]:
converted
0    0.880341
1    0.119659
Name: user_id, dtype: float64

e. The number of times the new_page and treatment don't match.

In [107]:
match = df[((df['group'] == 'treatment') == (df['landing_page'] == 'new_page')) == False].shape[0]
match
Out[107]:
3893

f. Do any of the rows have missing values?

In [108]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294478 entries, 0 to 294477
Data columns (total 5 columns):
user_id         294478 non-null int64
timestamp       294478 non-null object
group           294478 non-null object
landing_page    294478 non-null object
converted       294478 non-null int64
dtypes: int64(2), object(3)
memory usage: 11.2+ MB

There are no missing values in the dataset

Data Cleaning

2. -In this part:

    - The rows where treatment in group column does not match new_page in landing page column are dropped.

    - The rows where control in group column does not match old_page in landing page column are dropped.

    - All duplicate rows are also dropped

a. Create a new dataframe and store your new dataframe in df2.

In [109]:
df2 = df.copy()
index = df2[(((df2['group'] == 'treatment') & (df2['landing_page'] == 'new_page')) ==False) & (((df2['group'] == 'control') & (df2['landing_page'] == 'old_page'))==False) ].index
df2.drop(index , inplace=True)
#df.tail(5)
In [110]:
# Double Check all of the correct rows were removed - this should be 0
df2[((df2['group'] == 'treatment') == (df2['landing_page'] == 'new_page')) == False].shape[0]
Out[110]:
0
In [111]:
# Get indeces of the duplicated user records again
duplicated_user = df2.user_id[df2.user_id.duplicated()]

# Drop the duplicate user records
df2.drop(index=duplicated_user.index, inplace=True)
In [112]:
#Double check if duplicates are dropped
df2[df2['user_id'].duplicated()].shape[0]
Out[112]:
0

Data Analysis

PART 1: Probability

a. How many unique user_ids are in df2?

In [113]:
df2.nunique()
Out[113]:
user_id         290584
timestamp       290584
group                2
landing_page         2
converted            2
dtype: int64

4. Use df2 in the cells below to answer the quiz questions related to Quiz 4 in the classroom.

a. What is the probability of an individual converting regardless of the page they receive?

In [115]:
prob_converted = (df2['converted'] == 1).mean()
prob_converted
Out[115]:
0.11959708724499628

b. Given that an individual was in the control group, what is the probability they converted?

In [116]:
#individuals from control group
control_ind = df2.query('group == "control"').shape[0]
#individuals who are in control group and converted
control_and_convert = df2.query('converted == 1 & group == "control"').shape[0]
#probabiity of converted control
prob_of_control_and_convert = control_and_convert/control_ind
print('The probability of an indivdual converted from cotrol group is: {0:.4f}' .format(prob_of_control_and_convert))
The probability of an indivdual converted from cotrol group is: 0.1204

c. Given that an individual was in the treatment group, what is the probability they converted?

In [117]:
#individuals from control group
treatment_ind = df2.query('group == "treatment"').shape[0]
#individuals who are in control group and converted
treatment_and_convert = df2.query('converted == 1 & group == "treatment"').shape[0]
#probabiity of converted control
prob_of_treatment_and_convert = treatment_and_convert/treatment_ind
print('The probability of an indivdual converted from treatment group is: {0:.4f}' .format(prob_of_treatment_and_convert))
The probability of an indivdual converted from treatment group is: 0.1188

d. What is the probability that an individual received the new page?

In [118]:
#individuals that received new page
new_page = df2.query('landing_page == "new_page"')
#probability of individuals who received new page
prob_new_page = new_page.shape[0]/df2.shape[0]
print('The probability that an individual received the new page is {0:.4f}'.format(prob_new_page))
The probability that an individual received the new page is 0.5001

From the above observations, it is shown that both groups has equal proportions which is 12%. Therfore there is no evidence to say "The new page leads to more convertion".

Part II - A/B Test

Null and Alternative hypothesis;

Let us assume the old page is better, unless we prove the new page to be better, at type error 1 with 5% rate. Therefore the null an dalternative hypothesis is:

  null hypothesis(H0); Pš‘œš‘™š‘‘ >= Pš‘›š‘’š‘¤
  Alternative hypothesis(H1); Pš‘œš‘™š‘‘ < Pš‘›š‘’š‘¤

Also equivalent to;

  null: Pnew - Pš‘œš‘™š‘‘ <= 0
  alternative: Pnew - Pš‘œš‘™š‘‘ > 0

NOTE: Pš‘œš‘™š‘‘ and Pnew are the convertion rates of both old and new page respecively.

a. What is the conversion rate for $p_{new}$ under the null?

In [119]:
#convertion rate of new page 
p_new = prob_of_treatment_and_convert
#where probabilty of new page is (in four significant figures) 
print('probabilty of new page is: {0:.4f}' .format(p_new))
probabilty of new page is: 0.1188

b. What is the conversion rate for $p_{old}$ under the null?

In [120]:
#convertion rate of old page 
p_old = prob_of_control_and_convert
#where probabilty of old page is (in four significant figures) 
print('probabilty of new page is: {0:.4f}' .format(p_old))
probabilty of new page is: 0.1204
In [121]:
#The difference in conversion rate is
print('Difference in conversion rate is {0:.4f}.'.format(p_new - p_old))
Difference in conversion rate is -0.0016.

c. What is $n_{new}$, the number of individuals in the treatment group?

In [122]:
new_ind = df2.query('landing_page == "new_page"')
n_new = len(new_ind)
print('The number of individuals in treatment group is: {}'.format(n_new))
The number of individuals in treatment group is: 145310

d. What is $n_{old}$, the number of individuals in the control group?

In [123]:
old_ind = df2.query('landing_page == "old_page"')
n_old = len(old_ind)
print('The number of individuals in control group is: {}'.format(n_old))
The number of individuals in control group is: 145274
In [124]:
#The number of individuals in the entire dataset is (sampe size)
sample_size = len(df2)
print('The sample size is {}'.format(sample_size))
The sample size is 290584

Simulate 10,000 draws of $P_{new}$ - $P_{old}$ values, which help us to be more representative of the population

In [125]:
p_diffs = np.array([])

# Compute the sampling distribution
for _ in range(10000):
    # Generate elements from the new/old page groups using their probability
    new_page_converted = np.random.choice([0, 1], size = n_new, replace = True, p = [1-p_new, p_new])
    old_page_converted = np.random.choice([0, 1], size = n_old, replace = True, p = [1-p_old, p_old])
    
    # Calculate the difference in conversion rates
    p_diffs = np.append(p_diffs, new_page_converted.mean() - old_page_converted.mean())
In [126]:
# find elements equal to our sample size which imitates null hypothesis. 

p_diffs_null = np.random.normal(0, p_diffs.std(), size = sample_size)
p_diffs_null
Out[126]:
array([-0.00025746,  0.00096247,  0.00035378, ..., -0.00038651,
       -0.00012285,  0.00040741])
In [127]:
# Plot the distribution under the null along with the location of the sample mean
plt.hist(p_diffs_null, alpha=0.5)
plt.axvline(x = p_diffs.mean(), color = 'r', linestyle = '--')
plt.axvline(x = p_diffs_null.mean(), color = 'k', linestyle = '-')
plt.title('Sampling distribution of conversion rates')
plt.ylabel('Frequency')
plt.xlabel('Sample mean')
plt.show();

j. What proportion of the p_diffs are greater than the actual difference observed in ab_data.csv?

In [128]:
observed_diff = p_new - p_old

# Calculate p-value
p_value = (p_diffs_null > observed_diff).mean()
print('The probability of obseving the difference in conversion rate or higher values, \n' + 
      'given that the null hypothesis is true, = {0:.2f}.'.format(p_value))
The probability of obseving the difference in conversion rate or higher values, 
given that the null hypothesis is true, = 0.91.

In this observation above, we calculated the p-value, which is the sample statistics of knowing if null hypothesis is true. In this case observed diff will be high if null hypothesis is true.

So, if null hypothesis is true, convertions through ol page is equal of greater when compared with new page.
Therefore we hereby accept our null hypothsis.

l. We could also use a built-in to achieve similar results. Let n_old and n_new refer the the number of rows associated with the old page and new pages, respectively.

In [129]:
import statsmodels.api as sm

convert_old = len(old_page.query('converted == 1'))
convert_new = len(new_page.query('converted == 1'))

m. Now use stats.proportions_ztest to compute your test statistic and p-value. Here is a helpful link on using the built in.

In [130]:
z_score, p_value = sm.stats.proportions_ztest([convert_new, convert_old], [n_new, n_old], alternative='larger')
print('Z-score is {0:.2f} and p-value is {1:.2f}.'.format(z_score, p_value))
Z-score is -1.31 and p-value is 0.91.

n. What do the z-score and p-value you computed in the previous question mean for the conversion rates of the old and new pages?

The statistic have a type 1 error of 0.05% (0.95 confidence interval).So, the null hypothesis would be rejected if the z-score is less than -1.96 or greater than 1.96. In this case our z-score is -1.31, and therefore it will not be rejected.


And the p-value is close to 1, which also signifies we accept the null hypothesis.

Conclusion

The above observations was the comparison of two webpages launched by an e-cmmerce company, which they are not sure whether to keep the old one or implement the new page.

So, from the above observations we are able to see that the newly developed webpage does not proves to be better.

So we suggest that the company keeps the old webpage.

In [101]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Analyze_ab_test_results_notebook.ipynb'])
Out[101]:
255
In [ ]: